R Executive summary

Basic statistics

Sizes and basic statistics of data sets after removing null values:

Colors dataset:

## [1] 263   4
##        id             name               rgb              is_trans        
##  Min.   :  -1.0   Length:263         Length:263         Length:263        
##  1st Qu.:  83.0   Class :character   Class :character   Class :character  
##  Median :1005.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 651.4                                                           
##  3rd Qu.:1070.5                                                           
##  Max.   :9999.0

Elements dataset:

## [1] 60456     4
##    element_id        part_num            color_id        design_id     
##  Min.   :   9327   Length:60456       Min.   :  -1.0   Min.   :  1001  
##  1st Qu.:4565425   Class :character   1st Qu.:  10.0   1st Qu.: 18454  
##  Median :6111350   Mode  :character   Median :  28.0   Median : 41748  
##  Mean   :5517587                      Mean   : 120.4   Mean   : 45570  
##  3rd Qu.:6286413                      3rd Qu.:  85.0   3rd Qu.: 75474  
##  Max.   :6499141                      Max.   :9999.0   Max.   :107520

Inventories dataset:

## [1] 37265     3
##        id            version         set_num         
##  Min.   :     1   Min.   : 1.000   Length:37265      
##  1st Qu.: 14424   1st Qu.: 1.000   Class :character  
##  Median : 54379   Median : 1.000   Mode  :character  
##  Mean   : 61104   Mean   : 1.091                     
##  3rd Qu.: 88842   3rd Qu.: 1.000                     
##  Max.   :194312   Max.   :16.000

Inventory minifigs dataset:

## [1] 20858     3
##   inventory_id      fig_num             quantity      
##  Min.   :     3   Length:20858       Min.   :  1.000  
##  1st Qu.:  7869   Class :character   1st Qu.:  1.000  
##  Median : 15681   Mode  :character   Median :  1.000  
##  Mean   : 43010                      Mean   :  1.062  
##  3rd Qu.: 66834                      3rd Qu.:  1.000  
##  Max.   :194312                      Max.   :100.000

Inventory parts dataset:

## [1] 1180987       6
##   inventory_id      part_num            color_id         quantity      
##  Min.   :     1   Length:1180987     Min.   :  -1.0   Min.   :   1.00  
##  1st Qu.:  9404   Class :character   1st Qu.:   4.0   1st Qu.:   1.00  
##  Median : 22838   Mode  :character   Median :  15.0   Median :   2.00  
##  Mean   : 50849                      Mean   : 131.8   Mean   :   3.37  
##  3rd Qu.: 87088                      3rd Qu.:  71.0   3rd Qu.:   4.00  
##  Max.   :194312                      Max.   :9999.0   Max.   :3064.00  
##    is_spare           img_url         
##  Length:1180987     Length:1180987    
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Inventory sets dataset:

## [1] 4358    3
##   inventory_id      set_num             quantity     
##  Min.   :    35   Length:4358        Min.   : 1.000  
##  1st Qu.:  8076   Class :character   1st Qu.: 1.000  
##  Median : 16423   Mode  :character   Median : 1.000  
##  Mean   : 52519                      Mean   : 1.813  
##  3rd Qu.: 98685                      3rd Qu.: 1.000  
##  Max.   :191576                      Max.   :60.000

Minifigs dataset:

## [1] 13764     4
##    fig_num              name             num_parts         img_url         
##  Length:13764       Length:13764       Min.   :  0.000   Length:13764      
##  Class :character   Class :character   1st Qu.:  4.000   Class :character  
##  Mode  :character   Mode  :character   Median :  4.000   Mode  :character  
##                                        Mean   :  5.296                     
##                                        3rd Qu.:  5.000                     
##                                        Max.   :156.000

Part categories dataset:

## [1] 66  2
##        id            name          
##  Min.   : 1.00   Length:66         
##  1st Qu.:19.25   Class :character  
##  Median :35.50   Mode  :character  
##  Mean   :35.36                     
##  3rd Qu.:51.75                     
##  Max.   :68.00

Part relationships dataset:

## [1] 29977     3
##    rel_type         child_part_num     parent_part_num   
##  Length:29977       Length:29977       Length:29977      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character

Parts dataset:

## [1] 52615     4
##    part_num             name            part_cat_id    part_material     
##  Length:52615       Length:52615       Min.   : 1.00   Length:52615      
##  Class :character   Class :character   1st Qu.:17.00   Class :character  
##  Mode  :character   Mode  :character   Median :41.00   Mode  :character  
##                                        Mean   :38.91                     
##                                        3rd Qu.:60.00                     
##                                        Max.   :68.00

Sets dataset:

## [1] 21880     6
##    set_num              name                year         theme_id  
##  Length:21880       Length:21880       Min.   :1949   Min.   :  1  
##  Class :character   Class :character   1st Qu.:2001   1st Qu.:273  
##  Mode  :character   Mode  :character   Median :2012   Median :497  
##                                        Mean   :2008   Mean   :442  
##                                        3rd Qu.:2018   3rd Qu.:608  
##                                        Max.   :2024   Max.   :752  
##    num_parts         img_url         
##  Min.   :    0.0   Length:21880      
##  1st Qu.:    3.0   Class :character  
##  Median :   31.0   Mode  :character  
##  Mean   :  161.4                     
##  3rd Qu.:  139.0                     
##  Max.   :11695.0

Themes dataset:

## [1] 323   3
##        id            name             parent_id    
##  Min.   :  3.0   Length:323         Min.   :  1.0  
##  1st Qu.:205.0   Class :character   1st Qu.:186.0  
##  Median :469.0   Mode  :character   Median :411.0  
##  Mean   :419.9                      Mean   :360.6  
##  3rd Qu.:632.5                      3rd Qu.:512.5  
##  Max.   :751.0                      Max.   :697.0

Attributes analysis

Number of sets released over years

Top 10 years with most sets released based on inventory sets quantity

Color variety in sets over the years

Conclusions:

The plot shows a clear upward trend in the average number of unique colors used in Lego sets from around the 1950s to the present. Notably, there is a significant increase starting in the early 2000s, where the average number of unique colors per set rises more steeply compared to previous decades.

This trend could be indicative of Lego’s strategy to make sets more appealing and varied, perhaps in response to market demands for more intricate and visually stimulating products.

Distribution of the number of minifigs in set - 10 sets with most minifigures

Number of minifigures included per set over time

Conclusions:

There are distinct spikes observed in certain years, which could indicate special editions or series of sets that included more minifigures, or perhaps a general increase in the inclusion of minifigures in sets during those times. Following each spike, there is often a drop, which may suggest a return to the norm.

After 2010, there appears to be a downward trend, suggesting that recent sets might be including fewer minifigures on average.

Variables corellations

Correlation between size of a set and number of colors in a set

Conclusion:

The heatmap suggests that there is a positive correlation between the size of a Lego set (as measured by the number of parts) and the color diversity within the set (as measured by the number of unique colors). Sets that have a higher number of parts tend also to have a higher number of different colors.

Correlation between the number of parts in a set and the complexity of a set

Complexity of a set can be achieved by approximate the number of unique part categories used in each set.

## `geom_smooth()` using formula = 'y ~ x'

## [1] 0.5391932

Conclusion:

The scatter plot reveals a positive correlation between the number of parts and set complexity. As the number of parts in a set increases, the number of unique part categories tends to increase as well, suggesting that larger sets are generally more complex. This relationship seems to hold strongly for sets with a smaller number of parts, as indicated by the dense cluster of points toward the origin, where the increase in complexity with the number of parts is quite pronounced.

For sets with a very high number of parts (toward the right end of the X-axis), the data points become more spread out, indicating more variability in complexity for these larger sets. It suggests that once a set reaches a certain size, the addition of more parts does not necessarily increase complexity at the same rate. This could be due to the use of repeated parts within these large sets or a design choice to not increase complexity despite a higher part count.